Image classification and reconstruction from low-density EEG

Recent advances in visual decoding have enabled the classification and reconstruction of perceived images from the brain. However, previous approaches have predominantly relied on stationary, costly equipment like fMRI or high-density EEG, limiting the real-world availability and applicability of such projects. Additionally, several EEG-based paradigms have utilized artifactual, rather than stimulus-related information yielding flawed classification and reconstruction results. Our goal was to reduce the cost of the decoding paradigm, while increasing its flexibility. Therefore, we investigated whether the classification of an image category and the reconstruction of the image itself is possible from the visually evoked brain activity measured by a portable, 8-channel EEG. To compensate for the low electrode count and to avoid flawed predictions, we designed a theory-guided EEG setup and created a new experiment to obtain a dataset from 9 subjects. We compared five contemporary classification models with our setup reaching an average accuracy of 34.4% for 20 image classes on hold-out test recordings. For the reconstruction, the top-performing model was used as an EEG-encoder which was combined with a pretrained latent diffusion model via double-conditioning. After fine-tuning, we reconstructed images from the test set with a 1000 trial 50-class top-1 accuracy of 35.3%. While not reaching the same performance as MRI-based paradigms on unseen stimuli, our approach greatly improved the affordability and mobility of the visual decoding technology.


Preprocessing: Bad trial rejection
For the bad trial rejection, we first filtered the data as we aimed to assess the trials solely based on their SNR in the frequency regions of interest.Then, we segmented the signals into trials and discarded the parts recorded before and after the experiment, as well as the sections overlapping with the gray screen displays in between images.Each trial was successively checked for non-numeric (NaN) and flat values.A channel was considered to be flat if either the median absolute deviation from the median or the standard deviation of the samples was below a predefined threshold (1e-9 microvolts) for one of the channels.Whereas NaN values indicated connection problems between the EEG and the experiment computer, flat channels implied a loss of contact between an electrode and the scalp.Both trial rejection methods were adapted from an existing preprocessing pipeline 5 .Additionally, we applied a trial rejection based on correlations between the electrodes.EEG signals measured from the scalp commonly show a high correlation, especially if they are located in close proximity 6 .Therefore, we checked if the electrode correlations in a trial were in a reasonable range.If one or more of the channels would show a low absolute correlation (<0.3) with the other channels, this could indicate a loss of contact with the scalp.On the other hand, if a channel perfectly correlates with another electrode, this might signify unwanted connections of channels.The latter case could be due to the conductive gel creating a connection between multiple channels.The choice of the upper correlation bound differed based on whether the recordings were done with or without gel and varied per subject.We plotted the rejected trials for different bounds for every participant and decided for each bound whether the rejection was reasonable by visual inspection.All chosen upper bounds were in the range of 0.95 to 0.98.The output of the bad trial rejection was a mask for each recording that marked channels as bad for later exclusion.

Classification: Model-specific adaptations
Here we present the model-specific modifications that ensured the seamless integration with our framework and allowed for exploration of optimal hyperparameter combinations.As a general adaptation, we changed the channel-input size to work with our 8-electrode setup.

EEGNet
For the hyperparameter search, we varied the number of temporal filters ( 1 ) in the first convolutional block, but kept the number of spatial filters ( = 2) per temporal filter constant, similar to Lawhern et al. 7 .However, we explored a larger search space for the number of temporal filters compared to the original study to explore more complex architectures.The number of pointwise convolutions in the second convolutional block was set to  2 =  1 * , as proposed by Lawhern et al. 7 , who used the same notation.Moreover, we experimented with different Dropout probabilities to study the impact of regularization on the model's ability to generalize.For that, we adopted the probability values that the authors used for cross-subject (0.25) and within-subject classifications (0.5) as the search space.Additionally, we alternated between average and max pooling to gauge the effect of selecting the most prominent feature in the intermediate inputs or an average of the features.The other hyperparameters, like kernel dimensions for the convolutional and pooling layers, were adopted from Lawhern et al. 7 .

TSCeption
For TSCeption, we experimented with the depth of the temporal (T) and spatial (S) convolutions to alternate the model's complexity.We further varied the ratio coefficients that determined the different kernel dimensions of the temporal convolutions to alter the temporal patterns that could be extracted.Ding et al. 8 have only employed ratio coefficients of   = [0.5, 0.25, 0.125] with regards to the sampling rate, to capture frequency at 2 Hz to above, 4 Hz to above, and 8 Hz to above, respectively.As previous studies have emphasized the relevance of gamma frequencies for visual classification 9 , we wanted to explore distinct filters for higher frequencies, as well.In contrast to the approach by Ding et al. 8 , which divided the local spatial filters by hemisphere, we applied them separately to the four inferiorly located and the four superiorly located EEG channels over the occipital and parieto-occipital lobe.The strategy was to isolate activity originating from the primary visual cortex through inferiorly located electrodes and to discern activity from advanced visual areas using more superiorly located channels.Furthermore, we explored different Dropout probabilities to gauge the regularization effect on the model performance.

EEG-ChannelNet
Compared to the original implementation of EEG-ChannelNet by Palazzo et al. 10 , we varied the number () and depth () of the convolutions in the temporal (T) and spatial block (S).With regards to the spatial block, we had to reduce the number of filters, as the model was developed to work with a spatial dimension of 128 and applied four spatial convolution operations with kernel sizes of (128, 1), (64, 1), (32, 1) and (16, 1), respectively.For our setup with only 8 electrodes, we adapted the number of spatial filters as well as the kernel sizes.Thus, we either used two spatial convolutions of (8, 1) and (4, 1), or we added a third with a kernel dimension of (2, 1).Additionally, we adapted the padding employed for the spatial convolutions to (4, 0), (2, 0), and (1, 0) to fit with the aforementioned spatial kernels.Furthermore, to limit the computational complexity of the model, we reduced the number of residual blocks from 4 to 2 and modified the classifier head from Palazzo et al. 10 .The initial classifier head included a hidden layer, which we changed to directly map to the output.Similar to the previous models, we experimented with different Dropout probabilities.

EEG Conformer
In its original form, the depth of the convolutional operations and the embedding size of the EEG Conformer have been set to the same value ( = 40) 11 .We decided to study the effect of the convolution module depth and the embedding size, separately.Therefore, we varied the depth of the temporal and spatial convolution (  ) and employed the 1x1 convolution to map to a differing embedding size (  ).Additionally, the spatial kernel size was changed from (22, 1) to (8, 1) to fit with our channel count.As Song et al. 11 have shown the insensitivity of the Conformer to the self-attention depth and number of heads on two different EEG datasets, we adopted their settings of the attention block.However, we varied the pooling kernel size of the preceding convolutional block, which determined one of the dimensions of the input for the attention process.In their study, Song et al. 11 noted that the variation of the kernel sizes (m) between 15 to 45, in increments of 10, significantly affected the classification accuracy.Therefore, we defined the search space for the pooling kernel in that range.Notably, for the EEG Conformer, we did not use 30 epochs as for the other models, as the transformer requires more training time than purely CNN-based models 11 .Instead, we checked for super convergence using 50 epochs, but also allowed 200 epochs.The latter value was derived from the original paper showing convergence between 200 to 250 epochs.All other model-specific hyperparameters were adopted from Song et al. 11 .

EEG-to-Image-based model
Different from the previous approaches, the EEG-to-Image-based algorithm consisted of a pretrained deep learning model employed for feature extraction and a machine learning classifier for the actual classification.Due to the different input dimensions, we had to adapt the EEGto-Image transformation process.Mishra et al. 12 vertically stacked four copies of the transformed channel data and then vertically combined the stacked copies from each channel before resizing to (224, 224, 3).Instead, we vertically packed 28 copies per channel, mapping to a dimension of 224, and then resized the temporal dimension from 500 to 224, to obtain a similar outcome with fewer input channels.Mishra et al. 12 have reported the best performance using the pre-trained EfficientNet model 13 for feature extraction and an SVM as the classifier.Therefore, we only adopted that approach from the original study.However, we additionally did a hyperparameter search for the machine learning classifier.The SVM was tested with different kernels and various values for the C regularization parameter and the gamma kernel coefficient.The C and gamma hyperparameters were sampled from log-uniform distribution-returning values between  −10 to  10 , as previous researchers have recommended using exponentially growing search spaces 14 .The hyperparameter names were adopted from the Scikit-Learn framework 15 .

Hyperparameter Search
To test different hyperparameter combinations for each of the five models, we implemented a random search using Weights and Biases.The search space for each model can be found in Supplementary Table 2.
For each model, we first conducted 100 runs with randomly sampled hyperparameters from the predefined search space.Then, we inspected the validation loss for the respective combinations to refine the search space, before carrying out additional 50 runs to find the best hyperparameter set.As the optimal hyperparameter combination may differ depending on the dataset, we conducted the hyperparameter search for each subject individually.We did not have specific expectations on how the models with their respective hyperparameter combinations would perform since we tried them on a completely new dataset.Additionally, except for the EEGNet, most models either have only been tested on a flawed dataset (EEG-ChannelNet and EEG-to-Image) or have been used in non-stationary EEG tasks (TSCeption and Conformer).

Why the EEG-to-Image framework might be at chance level prediction
The EEG-to-Image classification paradigm remained around chance level performance in our analysis.As discussed before, the high accuracy reported in Mishra et al. 12 has likely arisen from predicting based on block-level temporal correlations arising from the EEG hardware.
Additionally, the pronounced difference between our EEG-to-Image classification performance compared to the results in Mishra et al. 12 might partly be due to the different hardware setup and especially the lower channel count.As we mentioned, due to the fewer electrodes, we had to increase the number of times each channel was stacked vertically.However, we further question the effectiveness of the method for the following reason.As the authors have not specified which layer of the pre-trained image model, they have used to extract features from the transformed EEG images, we assumed it would be the pre-classification layer.This assumption was reasonable as this layer had the smallest dimension of all layers (1280) and no additional dimensionality reduction methods were mentioned to mitigate the curse of dimensionality.However, in this layer, the pre-trained network already extracted high-level information to classify a real image as belonging to a certain category of the ImageNet dataset.
Notably, the 8-bit-grayscale transformation of the EEG yielded a spectrogram-looking output that was more abstract than images in ImageNet.In terms of assigning a category to the abstract input, we consider it to be unlikely that the small nuances in the transformed EEG images would yield drastically different output node activations.Therefore, the SVM received noninformative feature vectors leading to random predictions.Potentially, selecting an earlier layer from the pre-trained model that would extract low-level features and applying a dimensionality reduction technique before feeding the output to an SVM could be more successful.